This analysis is based on the premise that widening political divisions are not just about “what to do” but, more fundamentally, about perceptions of “what the priorities are.”
Perceptions of priorities are influenced, in part, by how frequently we hear about an issue. In turn, those who speak about issues we care about tend to attract out attention. It’s the nature of this positive feedback loop that is the subject of exploration here; how can we detect, using statistics, systematic differences in language reflecting priorities and biases.
Here, using standard NLP (Natural Lanuguage Processing) techniques, I explore this question looking for differences in the texts from recent Republican and Democratic presidential debates. Key findings are:
1. “Wordcloud” visualization reveal differences between candidates, though similarities are just as surprising.
2. Frequency analysis of keywords highlights strong differences between candidatates, but misses important context.
3. Bigram toeknization and word-stem searches begin to reveal subtleties of meaning.
The text of the presidential debates are downloaded from the UCSB Presidency Project. Transcripts were pasted into Apple Pages and stored as unformatted .txt files. From that point all processing is done with R using capabilities of {tm} and associated libraries.
The quickest and most visual method to compare texts is word-frequency analysis using wordclouds. Not surprisingly, word choices vary between candidates. However, there are also some striking similarities.
Let’s first compare the word clouds of candidates using the {wordcloud} package.
Differences between Donald Trump’s and Bernie Sanders’s dialogue at the debates are evident. Bernie’s word cloud is larger due to have spoken more wtotal words. However, despite the differences, what is most surprising is the similarity of the clouds; word choices like people, country, and going are common to both. Despite strong differences in policy, these word clouds reveal little about them.
c_wordcloud(trump_all)
c_wordcloud(sanders_all)
In this case word clouds couldn’t be more different. Hilary’s emphasizes think and people while Carly’s, a former business woman, primarily emphasizes government. However, there is no context to judge, for isntance, what Ms. Fiorina’s opinions or sentiments toward government are. Note again Hilary’s wordcloud is larger than Ms. Fiorina’s.
c_wordcloud(clinton_all)
c_wordcloud(fiorina_all)
Ted Cruz’s wordcloud seems to emphasize wonkish financial technicalities, like taxes and washington, while that of Mike Huckabee, a former minister, mixes language of Mr. Trump and Ms. Fiorina together. Again, no sentiment can be extracted.
c_wordcloud(cruz_all)
c_wordcloud(huckabee_all)
We can also split the text by debate. Since the debates cover different topics and questions, one might expect to see this reflected in the text of the separate dialogues. What’s surprising here is how comparable the language of each candidate is between the debates. Perhaps the candidates are more interested in staying on message than answering questions directly?
c_wordcloud(candidate_text_tc("TRUMP", r_oct))
c_wordcloud(candidate_text_tc("TRUMP", r_nov))
c_wordcloud(candidate_text_tc("SANDERS", d_oct))
c_wordcloud(candidate_text_tc("SANDERS", d_nov))
We can check word frequency directly by simply tokenizing the text and counting single words. In a sense this is equivalnt to the wordcloud analysia, but it more quantitative. To do this analysis some additional words like “thats”, “dont”, “back”, “can”, “get”, “cant”, and “come” are suppressed.
This table of the most frequent word used by each candidate.
| word | trump | sanders | clinton | fiorina | sum |
|---|---|---|---|---|---|
| think | 9 | 55 | 90 | 9 | 163 |
| know | 23 | 26 | 56 | 19 | 124 |
| well | 9 | 31 | 56 | 8 | 104 |
| people | 33 | 85 | 53 | 10 | 181 |
| government | 0 | 7 | 6 | 40 | 53 |
| every | 4 | 15 | 9 | 26 | 54 |
| need | 5 | 33 | 36 | 18 | 92 |
| country | 34 | 70 | 25 | 1 | 130 |
| going | 44 | 44 | 45 | 10 | 143 |
Word count differ widely, reflecting the vocabulary choices made by each candidate. In addition, it’s also apparent the number of words said by each candidate differed greatly due to the larger number of Republican candidates (~roughly ten) than Democratic one (~ roughly three). Indeed, the total number of words spoken by Carly Fiorina was 1580 and her vocabularly of distinct words was 702. By comparison, Bernie Sanders said 4314 total words, with a vocabulary of 1375 words.
From the above, there may be information in comparing words used frequently by one candidate to frequency of use by another. Here is a graph of the “top” words used by all candidates. From the above we need to be careful to normalize the word count, \(\nu_{i} = W_{i} / \sum_{k=1}^{N} W_{k}\), where \(\nu_{i}\) is the normalized frequency of word \(i\) with count \(W_{i}\).
In the graph below the \(\nu_{i}\) for each candidate are plotted for the most-used words as measured for the ensemble of all candidates.
This starts to be much more informative. For instance, Carly Fiorina mentions the word “government” as almost two percent of her word usage, whereas Donald Trump hardly mentions the word at all. Or notice that both Bernie Sanders and Donald Trump mention the word “wall” more than their competitors while Bernie Sanders alone mentions the word “street” with comparably high frequency.
The above doesn’t reveal much more information than the wordcloud analysis does. However, we can also pick some “key words” and sample for their frequency. For a first stab, let’s try
key_words = c("tax", "government", "climate", "class", "wall", "street","terror", "economy", "immigrant", "america", "veteran", "drug", "health", "gun", "education", "bankruptcy", "money", "women", "war", "rights", "abortion", "violence")
## Row.names trump sanders clinton fiorina all rank
## 1062 government 0.0000000000 0.001622624 0.0012992638 0.025316456 53 1
## 2671 wall 0.0065281899 0.006722299 0.0023819835 0.001265823 53 2
## 2447 tax 0.0071216617 0.002549838 0.0008661758 0.010759494 44 3
## 2377 street 0.0005934718 0.006490496 0.0025985275 0.001265823 43 4
## 1605 money 0.0065281899 0.004404265 0.0006496319 0.004430380 40 5
## 1128 health 0.0000000000 0.004172462 0.0030316154 0.001898734 35 6
Since word fequency does not convey specific positions on issues, let’s look at word associations to see if we can get closer to meaning from more information about the context of word usage. This analysis simply tokenizes the text as bigrams, then uses a simple function
bigram_table[grep(word, rownames(bigram_table), ignore.case=TRUE)]
to pull out relevant terms from the torkenized TDM. A key challenge is that the texts are relatively short, so statistics comparing the word frequencies are poor. Nevertheless, we can see that context around different words, even at the relatively unsophisticated level of simple bigrams, starts to hint at differences in approach to problems.
Bernie talks about “tax” and “terror” as well. His discussion of taxes has a reformist bent, but where Carly Fiorina talks associates words like budgeting, changes, reform, simplify, code, reform, and plan, Bernie Sanders associates words like cap, income, must, share, speculation, breaks, reform, wall, and rebuilding.